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Abstract 

We calculate the probability distributions for the number of occurrences n of of a given 
I letter word in a random string of k letters. We consider the case in which the letters 
of the strings belong to an r letter alphabet. Analytical expressions for the distribution 
are known for the asymptotic regimes (i) fc ^ r' 1 (Gaussian) and k,l ^ oo such 
that fc/r' is finite (Compound Poisson). However, it is known that these distributions do 
now work well in the intermediate regime k > r'' > 1. We show that the problem of 
calculating the string matching probability can be cast into a problem of determining the 
configurational partition function of a Id lattice gas with interacting particles such that 
the string matching probability distribution becomes the grand-partition sum of the lattice 
gas, with the number of particles corresponding to the number of matches on the string. 
Using this analogy, we perform a virial expansion of the effective equation of state and 
thereby obtain the probability distribution function. Our result reproduces the behavior 
of the matching distribution in all regimes, i.e. the asymptotic as well as the intermediate 
regimes, rather well. We arc also able to show analytically how the limiting distributions 
arise. Our analysis builds on the observation that the effective interactions between the 
particles consist of a relatively strong core of size Z, the word length, followed by a weak, 
exponentially decaying tail, whose overall strength decreases with increasing I. We find 
that the asymptotic regimes correspond to the case where the tail of the interactions can be 
neglected, while in the intermediate regime the effects of the tail needs to be incorporated 
into the analysis. This is ultimately responsible for the failure of the asymptotic distributions 
in this regime. Our results are readily generalized to the case where the random strings are 
generated by more complicated stochastic process such as a non-uniform letter probability 
distribution or Markov chains. We show that by varying the parameters of the stochastic 
process, the tails of the effective interactions can be made even more dominant rendering 
thus the asymptotic approximations less accurate in such a regime. 

PACS Nos: 02.10.Ox,05.70.Ce,2.50.-r 
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1 Introduction 

The problem of determining the probability of encountering (matching) a given string of 
length / in another string of length k, whose letters have been drawn randomly from an 
alphabet of r letters, has a variety of applications ranging from designing fast algorithms 
for pattern searching 1^, to problems in genetics such as assessing the likelihood of 
events such as the frequency of occurrence of DNA segments [3 |3] , or that certain DNA 
segments align In each of theses cases, the likelihood estimates for random sequences 

can be used as a benchmark against which one can evaluate the statistical significance of 
actually observed events. 

The problem is non-trivial, because of the possibility of overlapping occurrences in 
the string, which introduce correlations that need to be dealt with. Guibas and Odlyzko 
|Sl El derived the moment generating functions associated with the probability for 
not encountering a given set of words in a random string, whose letters were distributed 
independently and identically. The resulting distributions turn out to depend on a set of 
correlation functions that capture the overlap properties of the words with each other. 

Building on the work of Guibas and Odlyzko, several authors have studied the prob- 
ability distribution for the number of occurrences n of of a given / letter word in a ran- 
dom string of k letters, under various assumptions on the distribution of random letters 
[IIlIiailHlliaiHlliniliniiniElllH!: The cases where the letters of the random string 
are independently and identically distributed (i.i.d.) was treated by Fudos et al. [13], 
whereas the case where the letter distribution follows the steady state distribution of a 
Markov process has been investigated by several authors [HI [T21 CH HEl HE] ■ All of these 
results have been obtained for asymptotic regimes {k large + various assumptions on the 
length of the word /), where tools of statistics such as the central limit theorem (T3 | IT ^ E]. 
theory of large deviations fHlEI, or (compound) poisson approximations for rare-events 
[in IH 113 im are applicable. 

The regimes of applicability can be difficult to identify, however. It has been noted 
that, even in the case of i.i.d letters, when the length I of the word to be matched is fixed, 
and assuming the length of the random string to be large, the most accurate approximation 
to chose (gaussian or compound poisson) still depends on the word itself that is being 
matched [T^ . 

It is therefore desirable to come up with a single, explicit analytical expression for the 
probability distribution that is generally valid, and to obtain the asymptotic expressions, 
mentioned above, as special cases by taking certain limits. This is what we set out to 
do in this article. Besides the obvious advantage of having a single description, such 
an approach will naturally identify the regimes of application of the various asymptotic 
approximations, while also pointing out when and how they fail. 

It turns out that all of these issues are present even in the simplest case where the 
letters of the random string are uniformly and independently distributed. For the sake of 
simplicity and clarity of presentation, we will perform the analysis for this case. However, 
we will point out in detail how these results carry over to the more general case of random 
letter distributions. 

Our approach to this problem, which appears to be novel, can be summarized as 
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follows: We first show that the problem of calculating the probability distribution for 
the number of occurrences of of a given / letter word in a random string of k letters, 
can be rigorously mapped into the problem of calculating the configurational part of the 
grand-canonical partition function of a Id lattice gas. In this mapping the number of 
particles correspond to the number occurrences, the "volume" of the gas is the length of 
the random string, and the correlations between subsequent occurrences turn into pairwise 
interactions whose nature depends on properties of the word to be matched. It turns out 
that common to all interactions is a relatively strong and short-ranged segment of range 
I, followed by a weak and exponentially decaying tail. 

With the help of the lattice gas analogy, and by using techniques of liquid theory, such 
as the virial expansion, we are able to obtain an analytical expression for the probability 
distribution that reproduces the known asymptotic limits. We show how the distribu- 
tion crosses over into the asymptotic forms of the distribution, and thereby expose the 
conditions required for these limits to be applicable. 

More importantly, our method allows us to analytically treat the intermediate regime 
of moderate string lengths, k > >!, as well. This regime is most relevant for biological 
applications and turns out to be the hardest one to tackle analytically, since in this regime 
the effects of the tail are strong and need to be kept in the analysis. This is also the reason 
behind the deviations of the asymptotic forms from the actual distribution, since, as we 
will show, these distributions are obtained by neglecting the tail of the interactions. These 
deviations become more pronounced for short words and small number of letters in the 
alphabet, small I and r, respectively. 

Our results are readily generalized to the broader class of letter distributions, such as 
non-uniformly distributed letters or letters generated by a Markov process. Such distri- 
butions give rise to a broader class of effective interactions. In particular, it turns out 
that these interactions can have stronger tails than can be achieved by a uniform letter 
distribution. This potentially renders our method of approach even more relevant to such 
letter distributions. 

We would also like to note that our approach is in spirit similar to recent attempts at 
solving combinatorial problems, such as the k-SAT problem [211112111221123, using ideas 
borrowed from statistical mechanics fI^ |2H] • 

The article is organized as follows: In Section II we introduce our notation and for- 
malism, rederiving in this setting some of the relevant and known results. In Section 
III we establish the partition function analogy. We derive and study the properties of 
the effective particle interactions and then set up a virial expansion for the "equation 
of state". From the virial expansion we obtain the ra-particle partition function, which 
in this analogy corresponds to the n-match probability distribution function. We show 
how the various know limits arise and discuss the underlying assumptions. Section IV 
discusses the generalization and implications of our approach to the more general class of 
letter distributions studied in the literature and we will discuss our results in Section V. 
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2 The matching probabilities 

In this section we derive some of the known expressions for the the matching and ra-match 
probability, the probability of a at least one and precisely n occurrences of a given word, 
respectively. Besides providing a review of the relevant results, the main purpose of this 
section is to introduce our notation and provide the setting for the statistical mechanics 
approach to be taken up in the following Section. 

2.1 Definitions 

Assume that x, and y are variables that take values from an r letter alphabet such that 
x,y e {0, . . . ,r-l}. Let X = {xi,X2,X3, ...,xi) and y = ?/2, 2/3, • • • be two strings 
of I letters. Define the match indicator function $(x, y) as 

^^,y) = l[6{xt,yt) (2.1) 

t=i 

So that we have 

Let y = {yi, 2/2, ... , yk) be a string of length k > I and denote by ya,i = {ya+i,ya+2, ■■■,y, 
the substring of length I starting at position a, a = 0,1, . . . , k — I. Furthermore, let 

/,(x,y) = <l>(x,y,,,). (2.3) 

We have 

^"(-•^) = {o: otherwise. P'^) 

In other words, /a(x, y) = 1, if and only if x matches y at position a, and zero otherwise. 

2.2 The matching probabiUty 

Define p{m; x) to be the the probability that a given word x of length / is contained at 
least once in a randomly drawn string y of length m + I. We will refer to this as the 
matching probability. Let /Af(x, y) be the function that takes on the value one if the 
fc-string y contains the given Z-string x at least once, and zero otherwise. Using Eq. ()2.4j) . 
we can write 

k-l 

JM(x,y) = l-n[l-/a(x,y)]. (2.5) 

a=0 

Since k > I, it is convenient to define the excess length m = k — I. We thus find 

pi^-^ ^) = 1 - E n [1 - /"(^' y)] ' (2-6) 

y a=0 
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where r™"*" is the number of distinct k = m + Z-strings of r-letters, and the summation is 
over all such strings y. 

In j2n] ; the products on the right hand side of Eq. (|2.6|) were expanded into a Mayer-like 
sum, 

pK-) = ;iEE/'^-;^EE/'^^+;^E E fM----^ (2-7) 

y a y a<b y a<b<c 

(arguments of /„ will be suppressed in what follows) and the terms in the sum where 
evaluated approximately. Here we will take a different approach. 
The following algebraic identity will be of use in the following: 

m m b—l 

i-n(i-/«)=E^n(i-/'^)' (2-8) 

a=0 6=0 a=0 

with the convention that when 6 = 0, the product on the right hand side is set to one. 
Eq. ()2.8p is readily proven by induction. 

Using this identity, p{m; x) can be written as 

^ m b—l 

M-;x)=^$:$:/.na-/«)- (2-9) 

y 6=0 a=0 

Note that for any given b, the expression on the right hand side only involves the variables 
yi,y2, ■ . ■ , Ub+i- The sum over the remaining indices yields r™""^ and we find that 

m ^ 6—1 

6=0 yi--yb+i a=0 

Defining the correlator d{h] x) as 

6-1 

d(6;x)= h\{{^-fa). (2.11) 

yi---yb+i a=o 

p{m; x) can be therefore written as 

m ^ 

p(m;x) = ^— ci(6;x). (2.12) 

6=0 

We can obtain a recursion relation for d{b; x) by factoring out the a = term in 

Eq. dnn), 

6-1 6-1 

rf(6;x)= /,J](i-/„)- Y /o/6n(l-^)- (2-13) 

yi---yb+i 1=1 yi--yb+i a.=i 

The argument of the first sum does not contain the variable yi, while the sum over the 
remaining variables yields d{b — 1; x). Thus, 

d(6;x) = rd(6- l;x) - /i(6;x), (2.14) 
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a=l 



with the correlator h defined as 

yi---yb+i 
Note that for m = 

p(0;x) = l 
Comparing with Eq. (j2.12|) tliis implies that 

(i(0;x) = 1. 



ft 



(2.15) 



:2.16) 



:2.17) 



Since there are no constraints on h{0; x), we will define h{0; x) = 0. 

We next seek a recursion relation for h. Using the identity, Eq. ()2.8|1 . we find from 
Eq. (jTTKI) 



6-1 



yi--yb+i 



c=l 



c-l 



a=l 



(2.18) 



Recall from the definition of /b(x, y) that fb is a product of Kronecker deltas, Eqs. ()2.2j) 
and ()2.3|) . The Kronecker deltas enforce a transitive relation between their arguments, 
and we can write /b(x, y; 0) = /^(x, y; 0)/6(x, y; 0), where we have introduced an auxiliary 
set of variables y over which a sum is to be performed. Thus Eq. ()2.18|1 can be rewritten 

as 



6-1 

yi—yt+i c=i 



c-l 



a=l 



b-1 



fcfb — E < E /o 

c=i [ yi--yb+i 



c-l 



n(i-/« 



a=l 



yc—yb+i 



Defining the correlator C {b; x) as 

C(6;x) 



/o(x,y)/f,(x,y) 



(2.19) 
(2.20) 



yi--yb+i 



and substituting Eq. (ITTIll) into Eq. (ITT^ . we find 



6-1 



h{b; x) = C{b; x) — h{a; x.)C{b — a; x). 



a=l 

Using Eq. ()2.2()|1 . it can be easily shown that for 6 > Z 

C(6;x) = r^-'. 

Denoting the values of C(6; x) for 6 < / by C5(x), we have 

Cf,(x) = ^ /o(x,y;0)/fe(x,y;0), 0<b<l, 
yi—yb+i 



(2.21) 



(2.22) 



(2.23) 
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and thus 

C.-llLltl^'- (2.24) 

As evident from Eq. ()2.23|1 . the set of indices Cb(x) G {0, 1}, with b = 1,2, ... ,1 — 1 mea- 
sure the auto-correlations of x. They are referred to as the bit-vector c = (ci, C2, . . . , Q_i) 
associated with x, and were studied by Harborth and later in considerable detail by 
Guibas and Odlyzko Ellin] • 

2.3 Bit- vectors 

From the definition, Eq. ()2.23|) . it is clear that Cf, = 1 if and only if the string x shifted 
by an amount b relative to itself coincides on the overlapping part. Conversely Cb = 0, 
if the overlapping part does not coincide. It turns out that the set of possible words 
X of length I are partitioned into equivalence classes with respect to their bit-vectors 
c = (ci, C2, . . . Q_i) and that the possible classes are independent of the number of letters 
r (as long as r > 2) [Hj. Tables [T] and El list the sets of possible bit- vectors upto / = 8 
along with the number of elements in their respective equivalence classes for r = 2, 3, 4. 

We see that the definition of C{,(x) imposes strong conditions on the possible values of 
the / — 1 bits of a bit- vector and it turns out that the resulting bit- vectors have interesting 
properties [OlESl, of which we will mention only the most relevant ones. 

For example, if Cp = Cg = 1 with p < q this implies that Q = 1 for all t of the form 
t = p+i{q—p) with, i = 0,1,2, .. . and t < I. This is referred to as the forward propagation 
rule jH]. In particular, Cp = 1 implies that Qp = 1 for all i,l,2, . . . such that ip < I. The 
latter result shows that p can be considered as a period. We define the fundamental period 
of a string x, to be the smallest p, with < p < I such that Cp = 1. If x is such 

that its bit-vector is 000 ■ ■ - (all zeroes), we define x(x) = /. 

2.4 The n-match probability 

Denote by p{n; m, x) the probability that that a randomly drawn fc-string y contains a 
given /-string x precisely n times. We will refer to this as the n-match probability. 

If we let the random variable iV(x, y) denote the number of occurrences of x in y, it 
follows that 

m 

iV(x,y) = 5^/,. (2.25) 

a=0 

Thus the average number of matches (n) and its second moment (n^) are readily obtained 

as 

y a=0 

and 

(-') = ^EE/'^/- (2-27) 

y a<b 
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c 


r = 2 


r = 3 


r = 4 





2 


6 


12 


1 


2 


3 


4 


00 


4 


18 


48 


01 


2 


6 


12 


11 


2 


3 


4 


000 


6 


48 


180 


001 


6 


24 


60 


010 


2 


6 


12 


111 


2 


3 


4 


0000 


12 


1/1/1 

144 


720 


0001 


10 


66 


228 


0010 


4 


18 


48 


0011 


2 


6 


12 


0101 


2 


6 


12 


nil 


2 


3 


4 


00000 


20 


414 


2832 


00001 


22 


210 


948 


00010 


6 


48 


180 


00011 


6 


24 


60 


00100 


4 


18 


48 


00101 


2 


6 


12 


01010 


2 


6 


12 


11111 


2 


3 


4 



Table 1: Equivalence classes of bit- vectors 
and their number of elements. The table 
shows the bit- vectors c = (ci, C2, . . . , Q_i) 
associated with strings of length / = 2, 3, 4, 5 
and 6 and the number of elements in these 
equivalence classes for r = 2, 3 and 4 letter 
alphabets. 



c 


r = 2 


r = 3 


r = 4 


000000 


40 


1242 


11328 


000001 


38 


606 


3732 


000010 


16 


162 


768 


000011 


12 


72 


240 


000100 


8 


54 


192 


000101 


2 


12 


36 


000111 


2 


6 


12 


001001 


6 


24 


60 


010101 


2 


6 


12 


111111 


2 


3 


4 


0000000 


74 


3678 


45132 


0000001 


82 


1866 


15108 


0000010 


26 


462 


3012 


0000011 


22 


210 


948 


0000100 


16 


162 


768 


0000101 


8 


54 


192 


0000111 


6 


24 


60 


0001000 


6 


48 


180 


0001001 


6 


24 


60 


0010010 


4 


18 


48 


0010011 


2 


6 


12 


0101010 


2 


6 


12 


1111111 


2 


3 


4 



Table 2: Equivalence classes of bit-vectors 
and their number of elements. The table 
shows the bit- vectors c = (ci, C2, . . . , Q_i) 
associated with strings of length I — 7 and 
8 and the number of elements in these equiv- 
alence classes for r = 2, 3 and 4 letter alpha- 
bets. 
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The latter expression can be worked out using Eqs. ()2.20|) . ()2.22|) and ()2.23p and we 
find for the variance 



m + 1 1 



1 1 
+ ^ [(m + 1)(1 - 21) + l{l - 1)] + - 5^(m + 1 - 6)cfe(x). (2.2 



b=l 



Eq. (|2.28|) is a special case of a result due to Kleffe and Borodovsky [29j, who considered 
general distributions of random letters. 

Let ln,m{0'i, CL2, ■ ■ ■ ,CLn',^, y) bc the function that takes on the value 1 when n matches 
occur that are located at positions ai, 02, . . . , a„, with < ai < 02 < • ■ ■ < a„ < m and 
zero otherwise, 



ai — 1 



n(i-/n; 



ii=l 



/n,m(ai,a2,...an;x, y) 

a2 — 1 



fai 



n (1-/^^ 



.«2=ai+l 



fa 



an-1 



n (1-/-) 



«n=a„_i+l 



fa 



in + l=an + l 

(2.29) 



In terms of /n,m(ai, 02, . . . a^; x, y) we can write p{n; m, x) as 



p(ri;m,x)= ^ ^ /„,^(ai, 02, . . . x, y). 



(2.30) 



ai<a2<--<ari 



Analogously to the reasoning leading from Eq. ()2.18|) to Eq. ()2.19|) . it can be shown 
that the sum over y factorizes In^micii, • • • c^n! x, y) as 



^ /n,m(ai, 02, • • • an, X, y) - —^id{ai] 



i=l 



d{m — an), (2-31) 



where d and h are as defined in Eqs. ()2.1H1 and ()2.15j) . Thus p{n; m, x) becomes 



p{n; m, c) 



E 



y(i(ai) 



ai<a2<-<ar, 



n~l 



i=l 



rf(m-a„), (2.32) 



where we have changed the argument of the distribution function to p{n]m,c), to em- 
phasize that the distribution really depends on the bit-vector c only. It is readily seen 
that the sum over the positions is an n + 1 fold convolution of d and h. To simplify the 
results as well as well as to be able to obtain asymptotic expressions, we next introduce 
generating functions. 



2.5 Generating functions 

Define the generating function g{z) associated with a sequence g{b) by 

00 

6=0 



(2.33) 



M. Mungan String Matching and Id Lattice Gases (DRAFT, August 25, 2005) 9 



From Eqs. ^M), and (jT^ we find 



C{z;^) = c{z;^) + -^, (2.34) 
1 — zr 



where c(z; x) is a polynomial of degree / — 1, 



i-i 



c(z;x) = ^z^Cb(x). (2.35) 

6=1 

It is useful to also define the polynomial of degree /, X{z; c), as 

X{z; c) = z^ + r\l - ^) [1 + c{z/r; x)] . (2.36) 

Using Eqs. (imi . dT^ . (IT^ and jOSI), we see that the generating function of h 
and d are given in terms of c{z/r; x) as 

h{z/r-c) = l-- - = 1 i (2.37) 

^ ' ' l + C(z/r;x) l + c(z:x)+ ' ' ^ ' 



r' 1—2 



which can be written in terms of X{z; c) as 



and likewise, 



hiz/r;c) = l-r' (2.38) 

d{z r- c = / ^ ' ^ = r. 2.39 

1 — z A(z; c) 

The generating function for p{m; c) thus becomes |Hl 
Turning next to the generating function of p{n] m, c), 

oo 

p(n; 2;, c) = ^ z"^p{n] m, c), (2-41) 

m=ri 

we obtain (for n > 1) 

p(n; c) = ^^(^/r; c)2/i(z/r; 0)""^ (2.42) 

Eq. ()2.42|) is a special case of a more general result due to Regnier and Szpankowski 
[Ti] who consider a broader class of letter distributions, including inhomogeneous letter 
distributions as well as sequences of random letters generated from the steady-state of a 
Markov process. 
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As an aside, we can alternatively write p{n; z, c) in terms c{z/r) alone as 

Pin- zc) = ^ [r^il-z)cizM+z^Y-' 

' ' ^ [r^il-z)il + c{z/r)) + zr^'' ^ 

or in terms of the matching probability p{z; c) as 

p{n; z, c) = r\l - zfp{z- cf [l - r\\ - zfp{z; c)]""^ . (2.44) 

Note that from the last expression, we recover again Eq. (|2.4Uj) in terms of the generating 
functions, 

oo 

^p(n;2;,c) =p(2;;c). (2.45) 
In fact for ri = we therefore have 



p(0; z, c) = p(m; c) 



2.6 Asymptotic behavior 



\-z 



A(z;c) 



(2.46) 



Once the generating functions have been determined, the original functions can be ob- 
tained by an inverse transformation defined as follows: If /(z) is the generating function 
associated with /(&), then 

j(h) = ^1 dz^ f{z), (2.47) 

where dD is the boundary of a domain D in the complex plane that includes the origin 
and on which f{z) is analytic [5U] . 

Note that the generating functions of h{z;c), d{z;c), p{z;c), and p{n;z,c) are all 
rational functions, with their denominators involving X{z; c) or its powers and that they 
all go to zero as —>■ oo. For example, for the matching probability we have 

p(m,c) = (h dz — — -. (2.48) 

As we will show below, the behavior of p(m, c) for large m (and likewise for h, d and 
p{n; m, c) is dominated by the zeroes of X{z; c) that are closest to the origin. 

A numerical inspection of the zeroes of X{z] c) for 2 < Z < 10 and r = 2, 3, 4 shows that: 
(1) All zeroes Zi of A are distinct, (2) the zero of smallest magnitude, Zi, is real, and greater 
but near 1 and (3) all other zeroes have magnitudes of the order \\zi\\ ~ r,i = 2, . . . /. 
Fig. n shows a plot of the zeroes for / = 4, r = 2 and Z = 8, r = 2, 3, 4. In fact, it can be 
rigorously proven |H] that X{z; c) has a single (real) zero in a circular domain centered at 
z = 1 and of sufficiently small radius e. 

The asymptotic behavior of f{b) in Eq. ()2.47|) can be obtained by stretching the 
contour dD to infinity while circling around the zeroes of f{z) without including them. 
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Figure 1: Root Loci of the polynomial A(z;c), Eq. ()2.36|) . The figures are for (/,r) - 
values (starting from the top left and going clock-wise) (4,2), (8,2), (8,3) and (8,4) . 
Plotted in each figure are the roots associated -with the possible equivalence classes. For 
I = 4 these are c = 000 (+), c = 001 (*), c = 010 (diamonds) and c = 111 (triangles), 
while for the / = 8 cases they are c = 0000000 (+), c = 0000001 (*), c = 0000010 
(diamonds), c = 0000011 (triangles) and we have shown the roots associated with the 
remaining classes as small dots. The dashed circles correspond to ||2;|| = 1 and ||z|| = r 
and have been inserted as a guide to the eye. All classes have a root near z = (1, 0). The 
remaining roots cluster around and beyond the circle \\z\\ = r. 
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The integral over the boundary at infinity turns out to yields no contribution since for 
the cases of interest f{z) —>■ and the integrand is asymptotically of the order of at 
least This leaves the contributions from the zeroes of (1 — z)X{z; c), which are 

traversed counter-clockwise, if the contour at infinity is traversed clock-wise. 
Considering the matching probability, we find 

^(^) I f I 1 

p(m,c) = -y^ (p dz — — -, (2.49) 

t^27r2j;^^ z^+i {l-z)X{z;cy ^ ' 

where ODi is a clock-wise contour around the z*^ zero of {l — z)\{z; c), M{X) is the number 
of zeroes, and we assume that the zeroes are ordered such that zq = 1 < zi < \\z2\\ < 
. . . < \\zjsf\\. Evaluating explicitly the residues for the first two poles we have. 



p(m,c) = 1 — Ai ( — 



m+1 



dz 

2txiJqd^ 2;™+! (l-z)A(2;;c) 



(2.50) 



with the residue Ai given by 



Ai = — -. (2.51) 

X'{z,;c){l-z,) ^ ' 



The remaining zeroes Z2, z^, . . . zi of X{z; c) are located near and beyond ||2|| ~ r, so that 
in the limit of large m, their relative contributions are smaller. We thus arrive at the 
asymptotic form 



^ \ m+1 



p{m,c)^l-Ai\^—j (2.52) 
for large m. 

We can obtain approximate expressions for zi and thus an approximation of the asymp- 
totic behavior as follows. With 

X{z; c) =z^ + r\l - z) [1 + c{z/r)] (2.53) 

we see that when ~ 1, the second term in the above equation is a large term, r\ 
multiplied with a term that will be small due to the z — 1 prefactor. The product of these 
two terms can be made of order 1, if^; — l~l/r', which then can be made to cancel the 
first term ^' if 2: > 1. Using the Lagrange Inversion Formula, z — 1 can be expanded in 
a power series in 1/r' Letting u = z — 1 and t = l/r'[l + c{l/r)]^^, the equation 
X{z; c) = can be written in the form 



u = t (l){u), 



(2.54) 
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where 

0(n) = (l + ^y (2.55) 
is a formal power series in u. Thus 

oo 

21 = l + u(t) = 1 + (2.56) 

i=l 



with 

1 d^-V* 



u=0 



(2.57) 



i\ dw* -"^ 

One finds to leading and sub-leading order 

Ml = 1, (2.58 

and 

7/r, = / 

r 1 + c(l/r) 

with 



, (2.59) 



c'(l/r) = ^fc,(i) (2.60) 

1 = 1 



SO that to leading order we have 

zi = l + - ^-TT ^ (2-61) 

l + c(l/r) ^ ^ 

The residue Ai can be evaluated similarly, and we find to order 1/r' that 

l^cWl^^ (2.62) 
r' |1 + c{l/r)f 

Note that Ai is of order one. 

The asymptotic behavior of h{b] c) and d{b; c) can be worked out in a similar manner. 
For large b we find, 

Kb) ^ /^,,,(6) = -^—[r' {z, - 1)] ' f-V (2.63) 

and ^ 

d{b)^dasy{b) = ^^[r\z,-l)\ (^^^ . (2.64) 

where Ai is given by Eq. ()2.5H) . zi is the smallest root of X{z; c) and from the expansion 
of zi, Eq. ()2.56|) . we see that the terms in square brackets are of order one 
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Figure 2: Comparison of the asymptotic form, Eq. ()2.52|1 . witli tlie exact matching prob- 
abilities p(m;c). The figure shows the matching probability for / = 4, r = 2, and 
Q<k = m + l< 88. The open circles correspond to the numerically obtained matching 
probabilities for c = 000, 001, 010 and 111 (from top to bottom). The lines correspond 
to the asymptotic form, Eq. ()2.52j) . with Ai and zi calculated numerically. Inset: Plot 
of p(m; c) for intermediate values of m. The symbols are as in the main figure. The 
equivalence classes are (from top to bottom): 000, 001, 010 and 111. 



Taking the asymptotic forms of h and d to calculate the n-match distribution one finds 



p'^^\n; m, c) = Ai 



171 + n 



n 



Air^ i 1 



m+l— n 



(2.65) 



Figure El shows a comparison of the asymptotic form, Eq. ()2.52|) . with the exact match- 
ing probabilities p{m; c). The figure shows the matching probability for / = 4, r = 2, and 
6 < A; = m + / < 104. For I = 2 there are 4 equivalence classes: 000, 001, 010, and 111 
with 6, 6, 2 and 2 members, respectively (c/. Tables ^ and El for the case r = 2). The 
open circles correspond to the numerically obtained matching probabilities for c = 000, 
001, 010 and 111 (from top to bottom). For m < 20 {k < 24), p(m; c) was obtained 
by direct enumeration of all possible strings and checking for matches. For m > 20 a 
sampling algorithm was used: for each value of k, 10^ strings of length k were generated 
randomly and the matching probability was obtained by counting the matching strings 
of the sample. The solid lines correspond to the asymptotic form, Eq. ()2.52|1 . with zi 
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and Ai calculated numerically, from Eqs. ()2.36p and ()2.5H) for each of the equivalence 
classes. The inset shows p(m; c) for intermediate values of m, where we do not expect the 
asymptotic form to be very good. 

The discrepancies become much more severe when we consider the n-match distribu- 
tion. Fig. El shows the n-match distributions for a 4 letter binary string inside a random 
string of length k = 256 for the four possible equivalence classes c = 000 (top left), c = 001 
(top right), c = 010 (bottom left), and c = 111 (bottom right). The solid circles are the 
exact matching probabilities that were obtained numerically using the algorithm described 
above. The dotted line corresponds to the approximation Eq. ()2.65p . normalized by an 
overall constant. The dashed line corresponds to the gaussian approximation of Kleffe 
and Borodovsky [211, while the dot-dashed line is the compound poisson approximation 
of Chrysaphinou and Papastavridis [TT], Geske et al. [El, and Schbath [T7j. 

Note that while the approximation Eq. ()2.(i5j) performs very poorly, the gaussian 
and compound-poisson distributions approximate well the true distribution only for some 
equivalence classes c, but fail for others, as was noted by Robin and Schbath p^. The 
solid line on the other hand, is the single analytical result of this article and agrees well 
with the actual distributions. We now turn to the description of the n-match probability 
in terms of the (configurational) partition function of a Id lattice gas. 

3 The n-match probabihty as the partition function 
of a Id lattice gas 

In this section we present the statistical mechanics approach to calculating the n-match 
distribution function. We first map the problem into one of calculating the (configura- 
tional) partition sum of a Id-lattice gas. We next analyze the interaction emerging in 
such a description, then set up a virial expansion leading to an approximate evaluation 
of the partition function and finally discuss asymptotic limits. 

3.1 The n-particle partition function 

Our starting point is Eq. ()2.32|) . which we reproduce below for convenience. 



with d and h as defined in Eqs. ()2.1H) and ()2.15|) . The expression above for p{n;m,c) 
already resembles the partition function of a gas of n particles with particle boundary 
interactions proportional to — In c? and nearest neighbor particle-particle interactions pro- 
portional to — In h. In order to make this analogy work, we need to consider what we 
mean by the free-particle, i.e. no interaction limit. 




n-l 



(3.1) 
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Figure 3: The n-match distribution for matching a / = 4 letter binary string x inside a 
random string of length k = 256, for x = 0001 (top left), x = 1001 (top right), x = 1010 
(bottom left) and x = 1111 (bottom right). The circles are the exact probabilities, the 
dotted line corresponds to the approximation Eq. ()2.65p (normalized by an overall con- 
stant) and the dashed and dashed-dotted lines correspond to the Gaussian and compound 
poisson approximation (see text for details). The solid line is the analytical result of this 
paper. 



Note that d{b) and h{b) are conditional matching weights. For example, h{b) is the 
weight of the compound event: given a match at position a what is the likelihood that the 
next match is at a + b. The asymptotic behavior of d{b) and h{b), Eqs. (j2.64p and (|2.63|) . 
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can be interpreted to correspond to the approximation when the correlations inherent in 
the compound events are ignored. Thus the ratios d{h) / dasyih) and h{h) / hasyih) measure 
the strength of the correlations in such events. 

It is therefore natural to define the particle-boundary and particle-particle interactions, 
jjboun^^-^ and U{h), respectively as 

We thereby obtain meaningful physical interactions that vanish asb —>■ oo. Note that since 
the potentials do not have any characteristic scale, a temperature by itself is meaningless 
and we will write "energies" always with the pre-factor (3, i.e. in dimension-less units. 

The (configurational) partition function, Eq. ()2.32|) can now be written in terms of 
these interactions as 

m, c) = ^i^e^'^" ^ g-/3W„(ai,...,a„)^ (3 4) 



with 



and the Hamiltonian given by 



eP^' = Ai- {zi - 1)' (3.5) 



n-l 

Hn{ai, . . . , a„) = U'^^^iai) + [/""""(m - a„) + ^ f/(a,+i - a,) (3.6) 

i=l 

Eq. ()2.65|) corresponds to the free-particle limit (f/ = f/f, = 0), which in the probability 
language is the limit of all correlations suppressed. Before proceeding, it is instructive to 
study these interactions in more detail. 

3.2 Interactions 

Consider the particle-particle interaction first. From Eqs. ()2.63p and ()3.3|) we find that 



Zl 



h(hy-^z\, (3.7) 



where the term in square brackets is of order one, with respect to the small parameter 
l/r\ cf. Eqs. ^(H^ and ^M)- 

Figure E] shows the particle particle interactions for words of length / = 4 and / = 6 
as parameterized by their associated equivalence classes c. The potentials are plotted 
against distance measured in units of the word length / and have been vertically offset for 
clarity with the dashed lines representing = 0. The crosses on the dashed line indicate 
that the associated potential at that value is +00. We see that the potential have infinite 
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values only for b < I. Also, the values of the potential in the regime b < I are generally 
much bigger than in the regime b > I, meaning that the potential is stronger in the former 
region. We will refer to the region b < I and 6 > / as the core and tail of the interaction, 
respectively. For a given length / and depending on c, we also see that the interactions 
have different features. For c = ■ ■ ■ the interaction has a hard-core of size / followed by 
a repulsive tail, while for c = 1 ■ ■ ■ 1 the interaction has a strongly attractive compontent 
at 6 = 1, followed by a hard-core region for 1 < b < I, that goes over into an oscillatory 
but decaying tail. The potentials for the other values of c seem to be a mixture of these 
two types of behavior. 

Figure shows the behavior of the potentials associated with the equivalence classes 
c = ■ ■ • (left) and c = 1 ■ ■ ■ 1 (right) in their dependence on the word length /. For both 
equivalence classes we see that the tail of the interaction becomes weaker as / increases. 
When c = ■ ■ ■ 0, the core is hard-core and only the core-size / changes. The situation is 
different for c = 1 ■ ■ ■ 1 . For the c = 1 ■ ■ ■ 1 family of interactions we see that the attractive 
part of the core actually becomes stronger with increasing /. It turns out that the same 
is also true for the other equivalence classes, namely with increasing /, the cores of the 
interactions become stronger, while the tails become weaker. 



In summary. Figs. 0] and El suggest the following generic features of the interactions: 
(i) a strong core b < I, followed by a weak tail for b > I, and, (ii) for a given family of 
interactions, as I increases the core of the interaction tends to become stronger, while the 
tail of the interaction becomes weaker. 



These observations can be readily proven from the small b behavior of h{b), which in 
turn can be extracted from the recursion for h{b), Eqs. fl2.21|) and ()2.24|) . Thus we find 
for 6 < / 

{Cb, if X does not divide b, 
1, if6 = x, (3.8) 
0, otherwise, 

where x is the fundamental period associated with c that was defined at the end of Section 
12.31 Recall that by definition, h{0) = 0. 

Thus the interaction in the core region can be written as 

(3U{b) = -\nh{b) +b\n(—) -l\nr +/3Uo, b<l (3.9) 



where 



Zl 



is a constant that is of order 1/r', since the argument of the logarithm is of order 1 to the 
same order. 
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Figure 4: Plot of the effective potentials PU{b), Eq. ()3.7p . associated with the equivalence 
classes c of strings of lengths /. The potentials are plotted against distance measured in 
units of the word length /. Note that the potentials have been vertically offset for clarity. 
The dashed lines represent the U = lines for each potential. The crosses on the line U=0 
indicate that the associated potential at that point is +oo. Left: Interparticle potentials 
associated with words of length I = 4, for which the possible equivalence classes are 
c = 000, 010 and 111, as indicated in the figure. Right: same as left but for / = 6. Notice 
how the attractive part of the interaction emerges and grows stronger as the fundamental 
period of the string decreases to 1 ( c = 1---1). The tail of the interaction corresponds 
to the regime b/l > 1. 



We see that the interaction becomes +oo, whenever h{b) = 0. This is certainly the 
case for b < x- Furthermore, since r/zi > 1, in the core region finite values of U{b) 
increase with increasing b, as clearly seen in Fig. 01 

The first finite value of U{b) occurs at 6 = x- From Eq. ()3.9p we obtain for U{x) 



Xln(^^^ -l\nr + o(^^y (3.11) 



Thus it is apparent that for fixed l3U{x) becomes more negative as either / or r 
increase. In fact we see that to leading order, the dependence of PU{x) on I is linear, 
while its dependence on r is logarithmic. Also for x < / we see that f3U{x) is negative. 

The case when x = corresponding to c = 00 ... 0, is a little more complicated. In 
that case the In r terms in Eq. ()3.11|) cancel and we are left with a term / In zi which is of 
order 1/r' as well and thus (3Uq cannot be neglected anymore. However this means that 
the potential is of order 1/r', which turns out to be the correct scale of the strength of 
the tail and indeed decreases as r or / increase (see FigjSJ. To be specific, for x = Z, it 
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Figure 5: Plot of the effective potentials PU{b), Eq. ()3.7p . associated with the equivalence 
classes c = ■ ■ ■ and 1 ■ ■ ■ 1 and their dependence on the lengths /. The potentials are 
plotted against distance measured in units of the word length /. Note that the potentials 
have been vertically offset for clarity. The dashed linerepresent the U = line for each 
potential. The crosses on the a line indicate that the associated potential at that value 
is +00. Left: Interparticle potentials associated with the equivalence class c = ■ ■ ■ 
for words of length / = 3,4, 6 and 8. Note that the interactions have a hard-core of size 
b/l = 1 followed by a repulsive tail. The strength of the tail weakens with increasing 
I. Right: Interparticle potentials associated with the equivalence class c = 1 ■ ■ ■ 1 for 
words of length / = 3,4,6 and 8. Note that the interactions have an attractive part at 
6 = 1, followed by a hard-core for b/l < 1, and a weak, oscillatory decaying tail. Also 
note the opposite behavior of the strength of the core and the tail: With increasing /, the 
strength of the attractive part of the core is seen to increase, while the strength of the 
tail decreases. 



can be readily checked that h{b) = ' for I < b < 21 and the potential in this regime 
thus becomes 

0U(b)^?l±^ + o(l.). (3.12) 



where we have substituted the expansions of Ai and zi, Eqs. (j2.62|) and (|2.56|) . respectively, 
to lowest non-trivial order. Thus we see that for x = ^ the characteristic energy scale of 
the tail of the interaction scales like ~ //r', and decreases as / or r increase. 

The case of general x and c is similar, but the calculation are tedious yet straightfor- 
ward. Rather than doing this, we will motivate the result by considering the value of the 
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interaction at 6 = / + x, which is readily worked out from 

- 1, for X < V2, 

HI + X) = < r^- h{l) - 1, - Et=v+i h{r)c{l + X - 1), for //2 < x < /, (3.13) 




This means that 



for X = ^- 



h{l + x)=r^{l-e), (3.14) 



where er^ is at most / — x + 1 and hence of order I. Substituting this result into the 
expression for U{b) along with the expansions for zi and Ai, one finds that the result is 
of the form 

-mi + x) = ^ + oi^^y (3.15) 

where a is a c and / dependent constant of order one. 

To conclude, we find that the characteristic energy of the core of the interaction scales 
like — (/ — x) In r (x < /), while the energy of the tail goes to leading order like 1/rK These 
results are consistent with the behavior observed in Figs. |3 and El 

Turning to the particle boundary interactions, note that Eq. ()2.14j] . which can be 
conveniently written as 

^ = 1 - V ^ (3 16) 

a=l 

relates the properties of h to those of d. We thus see that analogous results can be 
obtained for the boundary interaction f/''°""(6) and we leave the details to the interested 
reader. 



3.3 The Hamiltonian 

The results of the previous section allow us to obtain approximate expressions for the 
probability p{n; m, c), by first approximating the effective Hamiltonian Hn and then car- 
rying out the configurational sums. This is most easily done using generating functions. 
Define the generating functions associated with Eqs. (|3.2|) and (|3.3|) as 

oo 

D{z) = J^A-'^^^^"), (3.17) 

6=0 

oo 

H{z) = J^A-'^^W. (3.18) 

b=0 

It is not difficult to show that in terms of the generating functions of d{b) and h{b), D{z) 
and H{z) are given by 

D{z) = e-^^(2i - l)rf c) (3.19) 

H{z) = e-f^h(^^;cy (3.20) 
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Using the convolution property, Eq. ()3.4|) can be written in terms of the generating func- 
tions D{z) and H{z) as 

p{n; m, c) = ^e^-^ £ dz^ D\z)H-'\z), (3.21) 

where the contour is again the boundary of a domain enclosing the origin inside of which 
D'^{z)H"~^{z) is analytic. Eq. ()3.2H) is the lattice analog of the partition function of a Id 
gas with pairwise nearest neighbor interactions. The Id continuum case has been treated 
in detail by Giirsey {see also Fisher ^^). 

Next, define the truncated generated functions D\{z) and H\{z) as 

A-1 

Dj,{z) = ^A-^^''^''), (3.22) 

b=0 
A-1 

Hj,{z) = (3.23) 

b=0 

It is readily seen that these generating functions are associated with the Boltzmann factor 
of an interaction that has been cut-off at 6 > A. The idea is that since, by construction, the 
interactions decay to zero at large distances, introducing a finite cut-off A will introduce 
only a small and controllable error in the overall calculation. In what follows, we will use 
this to set up a perturbation expansion of the probability distribution. We need to note 
however that since the result has to be a normalized distribution, setting the potential to 
zero beyond the cut-off will destroy the normalization of the distribution. Indeed there 
are at least two ways to handle the interaction beyond the cut-off: (i) we can either set the 
interaction to a constant for b > A and eventually choose such that the distribution 
is normalized, or (ii) we take the interaction beyond A to be rapidly decaying. It turns 
out that the calculation can be done for either of the cases. The approximation by a 
constant potential beyond the cut-off lends itself readily for obtaining error bounds, as we 
will sketch below. On the other hand, it turns out that the tail of the actual interactions 
does asymptotically decay exponentially. Thus letting the interaction decay exponentially 
beyond the cut-off turns out to be a very good approximation and we will calculate the 
probability distributions in this way. 

Consider the case of a constant potential beyond the cut-off first and define the ap- 
proximate interaction U (6) as 

with the corresponding generating function given by 

oo ^ 

H^{z) = J2 A-^^^(') = H^{z) + e-^^^ (3.25) 



6=0 
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Since d{z) is related to h{z) via Eq. ()2.39p . this implies a corresponding boundary 
interaction which can be worked out as 



Da(z) = J2 A-^^-(^) = Da(z) + e-^^- (3.26) 

6=0 ^ 

Define the approximation to p{n; m, c), Eq. ()3.2H) . as 

p(n; t/A, m, c) = f dz—^ Dl{z)Hl-\z), (3.27) 

It is clear that as A ^ oo we must have —>■ 0, since an increasingly larger part of 
the true interactions is kept. 

By using the definition of Ut^{h) and writing p[n] U/^, m, c) in the partition sum form 
of Eq. ()2.32j) , it can readily be verified that if 

U-<Ua< U+ (3.28) 

this implies that 

U-ib) < U^ib) < U+{b) (3.29) 
for all values of b, which in turn implies that 

p(n; [/+, m, c) < p(n; [/a, m, c) < p{n; U^,m, c). (3.30) 

Thus by choosing f/+ and f/_ as 

U+ = max{f/(6),t/^°""(6)}, (3.31) 

f/_ = min{f/(6),f/^°""(6)} (3.32) 

one could in principle obtain error bounds on the approximate distribution, which will 
become tighter as A oo. We will not pursue this any further in the present article, but 
instead perform the calculation with an exponentially decaying interaction beyond the 
cut-off A. 

Recall that the tail of the true interaction is due to the other zeroes of X{z] c), which 
are located a distance ~ r from the origin, (see Fig|T)). Thus superposed on the asymptotic 
behavior of /i(6), which we have shown to fall-off like ^f^, there will be terms that decay 
more rapidly and roughly as r~^, since zi < r. In fact it is the latter that are responsible for 
the asymptotic behavior of the interactions. For b large, we therefore take approximately 

h{b) ^ e'^^'z^^ + 7e^^r-'' (3.33) 
which upon taking logarithms and factoring out the first terms implies that asymptotically 

pU{b) ^ pfiblnzi-j (^^y , (3.34) 
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where we have neglected higher order terms '-)^{zi/r)^^ . Of course, with increasing cut-off 
A, the residual tail will be less important. 

This suggest taking the following approximate interactions: 

with the corresponding approximate generating function given by 

Mz) = E = ^a(-) + ^ (7) + 1^- (3-36) 

Since d{z) is related to h{z) via Eq. ()2.39|) . this implies a corresponding approximate 
interaction for the boundary interaction, which can be worked out, 

Mz) = E = ^a(-) + 7 (7) ^ + (3.37) 

b=0 

Denoting the generating function of the approximate tail of the interaction as 

r 

p{n; m, c) becomes 

P(^;7,m,c) = -^^J^J^^ Dl{z)Hl-\z). (3.39) 

What therefore remains to be done is to evaluate the contour integral, Eq. ()3.39j) . 
which can be carried out by the method of stationary phase, which in the context of 
generating functions is also known as Hayman's method |3Uj : 

3.4 Distributions 

Write the integral in Eq. ()3.39|1 as 

Then for large m, the value of the integral is given approximately by 

I^fl-Yl^, (3.41) 

\Um/ V^Trbm 

where Um is the smallest positive real root of the equation 

m = u-^\nf{u) (3.42) 
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and bm is given by 



d d^ 
bm = u— In f{u) + u^—^ In f{u). 
du du 



Applying Hayman's metliod to tlie integral, Eq. we let 

and find after a little bit of algebra 



(3.43) 



(3.44) 



m 



u—lnf{u) 
du 



X 



1 + Ax + x^{l + x)^ 


-2 




l + x{l + x)^-^ 







+ 



where we have parameterized u as 



^_ll + Ax + x2(l + x)^"2 






+x) 


^ l + x(l + x)^-i 


Ha 


(1^^) + 




1 



(3.45) 



u 



l+x 



(3.46) 



Since we are interested in solutions for large m, it is clear from the above that to 
leading order x (xl/m. Multiplying both sides of the above equation by x and expanding 
the fractions in a power series around x = 0, we obtain 



mx = (n + 1) |l + eix + 62^^ + . . .} 
The first two orders can be readily worked out, yielding 

2^A(l) + (n-l)i^A(l) 



ei = A - 7e 



n + 1 



(3.47) 



(3.48) 



and 



2Di(l) + (n- 1)^1(1) + 2 



n + l 

[2(A - 7O - 1] f2DA(l) - (n - l)i^A(l) 



2b'Al) + {n-l)H'{l] 



where 



1 



(3.49) 
(3.50) 



and 7 = 7(1 - 1/0^- 

Rewriting Eq. ()3.47p in a form suitable for Lagrange's Inversion Formula, 



X 



n + l 
m — in + 



(3.51) 
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We obtain an expansion of x in terms of (n + l)/[m — (n — l)ei] and tlie coefficients as 

es... . (3.52) 



n + 1 
m — [n — Ijei 



I 1 3 
72+1 



m — [n — l)ei 

The term bm can be worked out in a similar manner and we find 



n + 1 

m + - (n + l)(ei + 62) + . . . , (3.53) 



where the omitted terms are of order x and higher. 

Combining Eqs. (ICTD . (ICTll . with Eqs. (ITH^ and (Hn^ we finally obtain 

p(„;0.™.c) « ^(1 + .)".Z)i (^) HV (t^) (3.54) 

The strength of the tail, 7, is still undertermined and we will determine it by fitting 
the approximate tail to the actual interaction in the interval 6 G [A, A + Z — 1]. Note that 
this way there are no adjustable parameters and since the tail is only approximate, the 
normalization is not perfect and is found to vary by a few percent. Alternatively, one 
can choose 7 such that normalization is achieved. In either of the cases the distributions 
do not vary significantly, meaning that for a certain range of 7 values, the shape of the 
distribution is robust. 

The solid lines in Fig. El show the approximate distribution, Eq. ()3.54|) . for the four 
equivalence classes associated with words of length / = 4 and with r = 2, k = 256. We will 
refer to this approximation as the liquid theory approximation. In this and all the other 
results that we will present, the cut-off A was chosen as A = 3/ and x was expanded to 2nd 
order. The dashed lines in Fig.lHlare the Gaussian approximation of Kleffe and Borodovsky 
(KB) |2n] with the distribution mean and variance given by Eqs. ()2.26|) and ()2.28|) . The 
dot-dashed lines are the compound poisson (CP) approximation of Chrysaphinou and 
Papastavridis [TT], Geske et al. [T2] and Schbath [T7j. 

The variation between actual and approximate distributions, p{n) and p{n), can be 
quantified by the total variational distance between the two distributions and is 
defined as ^ 

dTv{p,p) = i^^\\P{n) - p{n)\\. (3.55) 

n 

Table 3 shows the variational distances between the actual and approximate distributions 
depicted in Fig. El (^ = 4) and k = 256. 

We see that the (un-normalized) liquid theory approximation, Eq. ()3.54j) (L), as well 
as the liquid theory approximation normalized by an overall constant (NL) perform bet- 
ter then the compound poisson (CP) and gaussian approximation (KB). Note that for 
c = 000, none of the approximations captures the height of the peak of the distribution 
accurately and we will remark on this shortly. 

Tables 4 and 5 show the total variational distances between the actual and approxi- 
mate distributions for word lengths I = 3, 4, 5, 6, 7 and I = 8 and string lengths k chosen 
such that /c/r' = 16, . i.e. the distributions have approximately the same mean. Overall, 
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c 




UrpY 




UrpY 


000 


0.052 


0.053 


0.189 


0.052 


001 


0.035 


0.031 


0.079 


0.075 


010 


0.011 


0.003 


0.108 


0.071 


111 


0.032 


0.021 


0.047 


0.148 



Table 3: Total variational distance between the actual distribution and the various ap- 
proximate distribution for the case r = 2, k = 256: liquid theory approximation (L), 
Eq. (j3.54j) . the liquid theory approximation normalized by an overall constant (NL), the 
compound poisson approximation (CP) and the gaussian approximation (KB). 

the liquid theory approximation, Eq. ()3.54|1 (L) , as well as the liquid theory approx- 
imation normalized by an overall constant (NL) perform better then or as well as the 
compound poisson (CP) and gaussian approximation (KB) taken by themselves. The CP 
approximation gives a better approximation for c = 11 ■ ■ ■ 1 and for some of the low and 
high X equivalence classes associated with / = 7 and I = 8. Also note that for / > 6 the 
CP approximation performs generally better than the KB approximation, as was noted 
before by Robin and Schbath ^Hj- The poor performance of the liquid theory approxima- 
tion for the case / = 3 and c = 00 turns out to be due to the fact that the expansion of 
X and bm to second order is not adequate. Upon calculating x (and &„) more accurately, 
the agreement with the actual distributions turns out to be nearly perfect. 

Regarding the robustness of the liquid theory approximations (L) and (NL), we have 
checked that going to a higher cut-off does not improve the distributions very much. 
Also, it turns out that for large x ^iid /, the first order expression for x is often sufficient, 
however it is almost always insufficient for small x ^i-nd in particular when % = 1, ie. x 
belongs to the equivalence class c = 11 ... 1. 

Fig. ini shows the n match distributions for / = 4 and with a string length that has 
been increased to k = 4096. Comparing with the case k = 256, Fig. El the distributions 
for small x ^^^6 more symmetric around their mean. The total variatonal distances are 

given in the table below. Note that they are comparable with the values that we obtained 
for k = 256, Table 3. 

The discrepancy between actual and approximate distributions for c = 000 is persis- 
tent: it does not improve with increasing A, or going to third order in the expansion of 
X, or by taking the stationary phase approximation to higher order (which turns out to 
be a 1/n expansion). The discrepancy for c = 000 does not seem to be a finite-size effect 
either as can be seen by comparing Figs. El and El 

On the other hand, increasing r, does reduce the total variations. Fig. (3 shows the 
n-match distribution for / = 4, m = 4092 and strings whose letters come from a 4 
letter alphabet. Notice that the total variation of the approximate distributions, are 
overall much smaller and all three approximations yield similar results. In particular the 
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c 


1 V 


jNL 


1 V 


jKB 


00 




(0.933) 


0.227 


0.006 


01 


0.027 


0.008 


0.156 


0.084 


11 


0.018 


0.016 


0.121 


0.131 


000 


0.052 


0.053 


0.189 


0.052 


001 


0.035 


0.031 


0.079 


0.075 


010 


0.011 


0.003 


0.108 


0.071 


111 


0.032 


0.021 


0.047 


0.148 


0000 


0.009 


0.010 


0.090 


0.018 


0001 


0.018 


0.016 


0.056 


0.043 


0010 


0.010 


0.008 


0.061 


0.050 


0011 


0.040 


0.036 


0.034 


0.089 


0101 


0.021 


0.024 


0.075 


0.056 


nil 


0.044 


0.026 


0.012 


0.154 


00000 


0.013 


0.011 


0.034 


0.028 


00001 


0.006 


0.004 


0.040 


0.030 


00010 


0.009 


0.011 


0.053 


0.028 


00011 


0.018 


0.019 


0.061 


0.028 


00100 


0.013 


0.011 


0.032 


0.053 


00101 


0.010 


0.006 


0.037 


0.055 


01010 


0.019 


0.011 


0.042 


0.066 


11111 


0.049 


0.027 


0.011 


0.152 



Table 4: Total variational distance between the actual distribution and the various ap- 
proximate distribution for the case r = 2 and {l,k) (3,128), (4,256), (5,512), (6,1024), 
(7,2048) and (8,4096): liquid theory approximation (L), Eq. ()3.54|) . the liquid theory- 
approximation normalized by an overall constant (NL), the compound poisson approxi- 
mation (CP) and the gaussian approximation (KB). 
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c 




jNL 


Urpy 


d 


KB 

T'V 
1 V 


000000 


0.025 


0.024 





004 





037 


000001 


0.003 


0.003 





028 





028 


000010 


0.004 


0.002 





023 





031 


000011 


0.005 


0.006 





031 





031 


000100 


0.004 


0.002 





023 





038 


000101 


0.004 


0.003 





029 





037 


000111 


0.004 


0.003 





029 





042 


001001 


0.011 


0.012 





033 





041 


010101 


0.023 


0.013 





026 





067 


111111 


0.052 


0.022 





015 





146 


0000000 


0.023 


0.022 





009 





040 


0000001 


0.022 


0.021 





006 





040 


0000010 


0.004 


0.002 





013 





031 


0000011 


0.018 


0.017 





004 





041 


0000100 


0.003 


0.002 





014 





034 


0000101 


0.010 


0.008 





007 





039 


0000111 


0.003 


0.002 





014 





037 


0001000 


0.003 


0.003 





017 





036 


0001001 


0.005 


0.003 





012 





040 


0010010 


0.005 


0.005 





017 





046 


0010011 


0.010 


0.007 





007 





055 


0101010 


0.025 


0.009 





012 





072 


1111111 


0.054 


0.020 





015 





140 



Table 5: Total variational distance between the actual distribution and the various ap- 
proximate distribution for the case r = 2 and {l,k) values of (7,2048) and (8,4096): 
liquid theory approximation (L), Eq. ()3.54|1 . the liquid theory approximation normalized 
by an overall constant (NL), the compound poisson approximation (CP) and the gaussian 
approximation (KB). 



c 


d^ 

Urpy 


dxv 


Urpy 


jKB 

Urpy 


000 


0.061 


0.060 


0.197 


0.060 


001 


0.035 


0.035 


0.076 


0.075 


010 


0.011 


0.004 


0.108 


0.065 


111 


0.045 


0.023 


0.038 


0.140 



Table 6: Total variational distance between the actual distribution and the various approx- 
imate distribution for the case r = 2, Z = 4 and k = 4096: liquid theory approximation 
(L), Eq. ()3.54|) . the liquid theory approximation normalized by an overall constant (NL), 
the compound poisson approximation (CP) and the gaussian approximation (KB). 
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Figure 6: The n-match distribution for matching a Z = 4 letter binary string x inside a 
random string of length k = 4096, for x = 0001 (top left), x = 1001 (top right), x = 1010 
(bottom left) and x = 1111 (bottom right). The circles are the exact probabilities, 
the dashed and dashed-dotted lines correspond to the Gaussian and compound poisson 
approximation (see text for details). The solid line is the analytical result, Eq. ()3.54jl 
normalized by an overall constant. 



deviations for c = 000 have disappeared now. Table 7 gives the corresponding variational 
distances: Comparing with Table 3, we see indeed that for r = 4 the total variational 
distances are overall smaller. 

It seems that for the case c = 000 and r = 2, the stationary phase approximation 
around the single point m 1 is not capturing all the contributions to the probability 
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c 




Urpy 


Urpy 


Urpy 


000 


0.008 


0.008 


0.016 


0.030 


001 


0.005 


0.003 


0.004 


0.034 


010 


0.006 


0.007 


0.014 


0.036 


111 


0.028 


0.013 


0.005 


0.080 



Table 7: Total variational distance between the actual distribution and the various approx- 
imate distribution for the case r = 4, / = 4 and k = 4096: liquid theory approximation 
(L), Eq. (|3.54p . the liquid theory approximation normalized by an overall constant (NL), 
the compound poisson approximation (CP) and the gaussian approximation (KB). 



distribution. 



Finally, we would like to remark that the expansion of x, Eq. ()3.52|) is in fact the 
virial expansion of the equation of state for the (discrete) lattice gas. The parameter x 
is related to2;asx = l/2; — 1, Eq. ()3.46j) . In the continuous Id gas of n particles in a 
"volume" L and nearest-neighbor interactions, the partition function can be written as 

[aHllSl! 

Q{n,L) = — Idse'^ D^{s)H''-\s) (3.56) 
2vrz J 

where D{s) and H{s) are the Laplace transforms of the Boltzmann factor for the particle- 
boundary and particle-particle interactions, and Eq. (j3.56|) is the inverse Laplace trans- 
form with an appropriately chosen contour. For physical interactions and in the thermo- 
dynamical limit, it turns out that the integral in the above equation can be evaluated 
by a saddle point expansion around the point sq and as a result, it turns out that 
So = PP, where j3 is the Boltzmann factor and P is the pressure [311 ■ Compar- 
ing with Eq. ()3.39|1 we see that upon discretizing the length of the container by letting 
L = mA, and assuming that the interactions vary slowly with respect to A, Eq. ( ()3.39p 
can be recovered under the identification 

e-*«^ = u = (3.57) 
1 + x ^ ' 

which for small A implies that x = sqA = /3PA. We thus see that the virial expansion 
Eq. ()3.52|) leads to a van der Waals type equation of state [35J. Indeed as can be seen from 
Eq. ()3.48j) . ei is the effective hard-core size and the term {n — l)ei is the total excluded 
"volume" due to the interaction (core + tail). 

Fig. ISl shows the " P — V isotherms" of the lattice gas with / = 4, r = 2 and fixed 
particle number n = 15 for the four equivalence classes c = 000,001,010 and 111 (from 
top to bottom). The thick solid line is the "ideal gas" law x = n/m. The data points 
have been obtained from numerically solving Eq. ()3.45|1 . Using the approximate equation 
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Figure 7: The n-match distribution for matching a / = 4 letter 4-ary string x inside a 
random string of length k = 4096, for x = 0001 (top left), x = 1001 (top right), x = 1010 
(bottom left) and x = 1111 (bottom right). The circles are the exact probabilities, 
the dashed and dashed-dotted lines correspond to the Gaussian and compound poisson 
approximation (see text for details). The solid line is the analytical Eq. ()3.54|1 . 



of state, Eq. ()3.47j) give similar results but with increasing deviations at high densities. 
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Figure 8: The "P-V diagram" of the lattice gas with I — A, r — 2 and fixed particle 
number n = 15 for the four possible interactions . c = 000, 001, 010 and 111 (from top to 
bottom). The thick solid line corresponds to the "ideal gas" law x = n/m (refer to text 
for details). 

3.5 Asymptotics 

We now consider the asymptotic form of the n-match distributions in the limit that the 
length k — m-\-l the random string is large. It turns out that this is most readily done 
using generating functions. We define the generating function p(C, z; c) of p{n, m; c) as 

oo 

p(C;m,c) = J]p(n;m,c)C (3.58) 

n=0 
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From Eq. ()3.21|) we thus find that 

A A °° 1 /" 1 

MC;m,c) = -ir + ^E(C^''^)"^f d^-;^ D\z)H-\z), (3.59) 

^1 ^1 „_i JdD Z 



where we have used the asymptotic form p(0; m, c) = Ai/z'^'^^ for the n = term, since 
m is assumed to be large. The order of summation and integration can be exchanged if 
the integrand is uniformly converging in the region of integration. It is not hard to show 
that this can be achieved for example by a circular path = R, with a suitably chosen 
R < 1. Thus carrying out the sum first, we obtain 

Substituting the approximate forms for D{z) and H{z), Eqs. (|3.37|) and (|3.37|) . we find 



p iC;i,m, c) 
+ 



m+l 

Z\ 



Ce^/^ / dz 1 {z^ + (1 - ^) (^a(^) + r(^))]' 



+1 27ri Jqj^ ^"^+1 1 - z (1 - [1 - Ce^^ {iik{z) + r(z))] - (e^^z^ ' 

(3.61) 



Denote the expression in the denominator by \{^z; (, c), 

Mz; C, c) = (1 - z) [1 - Ce^^ (HAiz) + Viz))] - (e^^z^. (3.62) 

Since exp(/5/i), is of order 1/r' it follows that X{z;(,c) has a root near z = 1. It turns 
out again that this is the root closest to the origin and that all other roots are of order 
||z||^C6xp(/3/i) ~ 1. Denoting the root of smallest magnitude by zi, and using the method 
of Section 12. 6[ a series expansion of Zi can be made. One finds to lowest order that 

l-Ce/^^ifA(l)-Ce/5/^r(l)- ^^-^^^ 

The integrand in Eq. ()3.6ip has therefore two dominant poles at z = 1 and z = Zi. 
For large m, the contour integral can again be evaluated approximately by pushing the 
countour out to infinity and keeping only the residues from the dominant poles (which 
are traversed counter-clockwise), as explained in Section EEl 

We find 



[{l--z^){D^{-z^)+T{-z,)) + 4]\ 



{z^z,r^'l-h V A'((z;C,c) 

(3.64) 

Notice that the m dependence is entirely confined to the term l/(zi^i)™+^. Thus this 
term alone is responsible for the large m behavior. The term in the square brackets is 
the effect due to the boundaries of the string. When m is large boundary effects should 
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not matter and we will set this term to 1. Alternatively, we can assume that the random 
string is circular and in this case the boundary term will not arise. 

Apart from the cut-off assumption on the behavior of the tails, and the assumption 
of large m leading to the m-asymptotic expression, Eq. ()3.64j) . we have not made any 
assumptions on r or Z so far. To proceed further, we will assume that -C 1 so that 
the lowest order expressions for zi and zi will provide the leading order approximation to 

Eq. (uni. 

Substituting the lowest order expression for Zi, Eq. (j3.63j) and noting that to this order 
—X'{{z;(,c) = 1 — (exp{Pfi)H\{l) — (exp{Pfi)T{l), the result simplifies to 



A, 



I - \m+l 
[ZlZx) 



(3.65) 



The compound poisson distribution arises in the limit when m ^ oo and (n) is finite. 
From Eq. ()2.2(jj) this implies that the word length / scales as I ~ log^(m + 1). From the 
properties of the interactions that were derived in Section 13.21 we see that the tails are 
very weak and of order 1/m, while the core is relatively strong and of order logm. Thus it 
is permissible to set A = / and ignore the tails (F = 0). Note that in this limit 1/r' ~ 1/m 
and thus to lowest order = 1, and 



1 



[l + c(l/r)]^ 



(3.66) 



We thus obtain 



p(C;7,"^,c) 



1 + 



l + c(l/r) 



C 



' [1 + c(l/r)]' 1 - Ce'^'^i^Ki; 



-(m+l) 



(3.67) 



Further simplifications occur, noting that from Eq. ()2.37p to order 1/r' we have 



l + c(l/r) 
while from Eq. (j3.2(Jj) we find that 



(3.68) 



(3.69) 



Multiplying out the product in Eq. ()3.67|l and keeping only terms to order 1/r' ~ 1/m, 
we thus obtain 



p{C;i,m, c) 



1 



h 



1 



1 



c 



l-/i(i;c) l-ChC-;c) 



-(m+l) 



(3.70) 

Taking now the limit m — >■ oo such that (m + l)/r' = (n) is finite, the expression is readily 
brought to the form 

p(C;7,m,c) = e^^?^i(^~'^')^^ (3.71) 



M. Mungan String Matching and Id Lattice Gases (DRAFT, August 25, 2005) 36 



with 



Xj = (n) 



1 - h 



-; c 



1 2 



/i ( -; c 

r 



(3.72) 



Eq. ()3.7H) is the generating function of a compound poisson distribution and precisely 
the result derived by various other methods by Chrysaphinou and Papastavridis jTT] . 
Geske et al. JEj, and Schbath ^\ in the special case of uniformly i.i.d letters. Also note 
that the CP distribution is normalized, p(l;7,m, c) = 1. 

Note that setting the tails {b > I) of the interactions to zero means that given the 
next match is a distance at least / away, it can occur with equal probability at any b > I. 
Since nearest neighbor match separations b < I define an overlapping cluster, this means 
that the location of the clusters themselves, b > I, are distributed like the arrivals of a 
poisson process [TTl UHl UHl • We therefore see that the liquid theory description in terms of 
interactions along with the separation of cores and tails provides an alternative and very 
simple explanation of this property. Conversely, strong tails mean that the positions of 
the clusters themselves are correlated and deviate from a poisson process (meaning that 
the probability of initiating a new cluster depends on the distance from the last cluster). 

We now consider the limit m — *■ oo and n —>■ oo such that in this limit the number 
density n/lm + 1) = 1/r' remains constant and is small. In this limit the tails of the 
interaction are also small, and we obtain (to lowest order in 1/r') 



p{C;^,m, c) 



1 + 



1 



-(m+1) 



l + c(l/r) rV. 
1 



1-i I I 

r' [1 + c(l/r)]' 1 - Ce^'^Hiil) - Ce^f^Tir 



-(m+l) 



(3.73) 



Notice that if A = oo, there would be nothing left for the remaining tail and thus F would 
be zero and we would obtain 



p{(;0,m, c) 



1 + 



1 



l + c(l/r) 



r' [l + c{l/r)fl-Ch{l/r;c)J_ 



-(m+l) 



(3.74) 

The normalization is given by p(l;0,m, c) = 1, and using the relation Eq. ()3.68p it is 
readily seen that the distribution is normalized to order 1/rK This observation immedi- 
ately gives us a way to estimate r(l), which must be chosen such that the distribution is 
normalized to that order. We have 



./>^T{l) = e^'^Hi{l)-h(^pc^ 



(3.75) 



and the normalized distribution becomes 



p(C;m, c) 



h \ -[C 

r 




C [l-h{hc)\ 
1 — (h(l/r; c) 



-(m+l) 



(3.76) 
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The large n hmit can again be obtained using Hayman's method introduced in the previous 
sub-section. Choosing such that 



n 



(^C^lnp(C;m,c)^ 



(3.77) 

C=Co 



we find to order 1 — {n)/n 



C. . 1 4 (l - M) . (3.78) 

2 1 + h{l/r; c) \ n J 

where {n) is as defined in Eq. ()2.26|) . Using this approximation for Qq, we find after a 
httle bit of algebra that the distribution of n around its mean is Gaussian distributed, 

,(C;,,e,^exp(-(!i^). (3.78) 



with 



In concluding this section we would like to point out that our derivation of the CP and 
Gaussian asymptotic forms rests on determing the dominant root of X{z; c), Eq. ()3.62|) . 
which in turn emerges as a result of introducing a cut-off A and approximating the inter- 
actions beyond A. In a sense, it is the presence of the cut-off that simplifies the analytical 
treatment of the problem, since it makes explicit the separation of small and therefore 
negligible terms from the dominant ones. 



4 The case of general random letter strings 

All the calculations and results presented so far, have been worked out for the case of 
uniformly and i.i.d letters of the random string. However for many applications this 
requirement is too restrictive. Letter distributions that have been considered in the lit- 
erature are non-uniform i.i.d letters and letter sequences generated by a Markov process. 
For either of the cases asymptotic results in the form of large deviations, Gaussian and 
compound poisson distributions exist [TT | IT ^ ITHllT H ITH t lT7 | lil[TH t lT!)j. 

In this section we show that the n-match probability associated with a broader class of 
letter distributions can be worked out using the lattice gas description introduced in the 
previous section. The essential insights gained from this approach are not changed by this 
generalization. The problem to be solved is still that of calculating the partition function 
of a Id lattice gas of n particles with nearest-neighbor interactions among themselves and 
the boundaries. The only difference is that the interactions and hence the calculations 
become more involved. 

The required generating functions have been already derived by Regnier and Sz- 
pankowski [I^ and we will adopt their results to our notation. Let again y = {yi, 1/2, ... , Uk) 
be the letters of the random string and let x = (xi, X2, . . . , Xi) be the word to be matched. 



M. Mungan String Matching and Id Lattice Gases (DRAFT, August 25, 2005) 38 



Regnier and Szpankowski consider the case of i.i.d letters with arbitrary letter distri- 
bution (Bernoulli Model) and letter sequences generated by a one-step Markov process 
with transition matrix P, such that Pij is the transition probability P{ya+i = i\ya = j}, 
IT = (7ri,7r2, . . . ,7Tr) is the stationary letter distribution satisfying vrP = vr, and the sta- 
tionary matrix 11 is the matrix whose r rows are vr (Markov Model). 

Given any subsequence of letters ya+i, ya+2, ■ ■ ■ , ya+u denote by p{ja,i) the probability 
of encountering y^/, without any conditions on the letters preceeding or following it. 
Likewise, denote by p(x) the probability of generating the word x. The generating function 
of the n-match probability is given by 



with 



p{n-z,c) = p{^)d\z-c)h^-\z-c), (4.1) 



7/ N 1 — h{z\ c) 1 1 , , 

— — (4.2) 



and 

\{z- c)=z^ + 



1 + c{z) + ^^T{z)z^ 



(4.4) 



p(x) 

In the last equation n{xi) is the steady state probability of encountering the letter xi and 
T{z) is the generating function for the steady-state transition probability from the end of 
one word match to the beginning of the next word match as a function of the gap length 
between the two words (for the Bernoulli Model T{z) = 0). The generating function c{z) 
is defined as 

6=1 

where Cft(x) are the bit- vectors associated with the word x. Note that an overall factor of 
z^ in the definition of p{n; z, c) in [14J is absent, since the generating function p{n; z, c), 
as defined above, corresponds in our case to p{n; m, c), where m = k — I is the effective 
length of the string. 

Comparing with the corresponding equations of the uniformly distributed random 
letter case, Eqs. ()2.42|1 . ()2.38j) . ()2.39|1 and ()2.3(ij) . we see that the form of the equations as 
well as the relationships between the generating functions are identical. 

In particular, all recursions can be recovered by making the replacements ha/r"- — * ha, 
da/r°- da, and Ca/r" — > so that h{z/r) — > h{z) etc. 

The Markov property introduces the additional complication that one has to propagate 
the end of one word match at Oj to the beginning of the next match at Cj+i through the 
(cj+i — — /)-step steady-state transition probability. 

Regnier and Szpankowski have also proven that the polynomial X{z; c) has at least one 
real root and that all roots have ||2;|| > 1, as in the case of uniform letter distributions. 
The asymptotic behavior of h and d is again due to the root closest to 2; = 1. 
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As can be seen from Eq. fj4.4|) . for p(x) small, the root closest to z 
roughly 

p(x) 



1 + 



1 is located at 
(4.6) 



and all other roots are roughly located at ||2;|| ~ l/p(x). Recall that in the case of 
uniformly distributed letters, p(x) = 1/r' < 1/2. For the general letter distributions, 
both the distribution as well as the word x can be chosen arbitrarily and thus there is no 
constraint on the values that < p(x) < 1 can take. This means in particular that there 
is a broader class of possible interactions. 

Defining again the effective particle-particle interaction as 



-l3U{b) 



h{h) 
hasy{hy 



(4.7) 



hasyip) is readily worked out as 



Zl 



{Zl - If 
p(x) 



(4.8) 



where Ai{zi — 1) 
interaction 

where 



-l/\'{zi), cf. Eq. ()2.51|) . We thus obtain for the particle-particle 
f3U{b) = - In h{b) -b\nzi + Inp(x) + (3Uo, (4.9) 



/3Uo = In 



Ai 

Zl 



Zl 



p(x) 



(4.10) 



If p(x) ^ 1, f3Uo is a constant of order p(x), since the argument of the logarithm is of 
order 1 to the same order. 

In the core-region b < I, the non-zero values of h are still determined by the bit- vector 
c associated with x and we find analogous to Eq. ()3.8|) that 



h{b) 




if X does not divide b, 

if & = X, 
otherwise. 



(4.11) 



where x is the fundamental period associated with c that was defined at the end of Section 
12.31 and by definition, h{0) = 0. 

We see that the interaction is +oo, whenever Cb = 0. This is certainly the case for 
b < X- The interaction in the core-region is given by 



pU{b) = -Incfe - Mn [V/'(xi,f,)] + Inp(x) + pUo, 



(4.12) 



Comparing the above with the uniform letter distribution case, Eq. (j3.9j) . we see sim- 
ilarities as well as differences: the argument of the logarithm in square brackets is no 
longer necessarily smaller than one, but can depend on the subtle interplay of the overall 
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word matching probability p(x) (which determines zi) with the (smaller) probabilities of 
matching the subwords p(xi_b). Thus such cases require more care. Assuming that p(x) is 
sufficiently small so that the expression in square brackets is smaller than one, it is again 
the case that the energy of the core region is of order Inp(x) and increases with b. Under 
the same assumptions, the characteristic energy of the tail region can be worked out and 
one finds that it goes like p(x) similarly to the line of reasoning in Section 13.21 

Thus we see that by suitably choosing the set of probabilities p(xi^b), for 6 = 1,2 ... ,1 
the strengths of the core and tail of the interactions can be varied and can possibly move 
the distribution functions into a regime where approximations ignoring the contributions 
from the tails (such as the compound poisson approximation) are inappropriate. 

Also note that by letting p(x) to be arbitrarily close to one, the difference of magnitudes 
between the root closest to the origin zi and the other roots can be made to vanish. Since 
the attenuation length of the tail of the interactions depends on the separation of these 
roots, we see that in this limit Zi ceases to be the dominant root. For the interactions this 
means an increasingly more slowly decaying tail as the two roots approach each other. In 
such a regime the tails of the interactions should become very important and thus cannot 
be neglected. 

It is possible that for certain distributions and choices of words, this can cause a 
break-down of the liquid theory approach, which essentially is a perturbation theory and 
it would be interesting to find out if and how this can happen. Strong tails will certainly 
affect the quality of approximations such as the compound poisson distribution, which 
was based on the assumption that tails can be ignored, which turned out to be equivalent 
to assuming a poissonian distribution of cluster locations, as explained in Section ITHl 

We will further discuss these points in the Discussion section below. 

5 Discussion 

We have presented a new approach to calculating the probability distribution for the 
number of matches of a given word inside a random string of letters. Our approach rests 
on the observation that the exact expression for such a distribution can be interpreted as 
the partition function of an ra-particle system on a linear lattice, with pairwise nearest 
neighbor interactions. By exploiting this analogy and focusing on the generic properties 
of the interaction, we have been able to set up a virial expansion for the equation of state 
of this lattice gas and thereby obtained an analytical expression for the n-match probabil- 
ity distribution, which besides extrapolating between the known asymptotic forms, also 
provides a good approximation in the intermediate regimes. 

The identification and subsequent analysis of the effective interactions in the lattice 
gas description turns out to be key in our solution of this problem. The interactions are 
characterized by a strong core-region of the size of the word-length followed by a relatively 
weak and exponentially decaying tail. Although we have carried out the detailed analysis 
for the special case of uniform letter distributions, we showed in Section IV, that our 
method is readily extended to the broader class of distributions, such as non-uniform letter 
distributions and random letter sequences generated by a Markov process. Regardless 
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of the underlying stochastic process for the random string, the generic feature of the 
interactions are still the same, namely a relatively strong core and a weaker tail and our 
approach should be readily applicable to these types of problems as well. 

We should also point out that our method of approach bears some similarity with the 
work of Regnier and Szpankowski ^3] , who also use generating functions in their approach 
to this problem. Our approach is however distinct in at least one crucial point: In the 
cited work, upon deriving the generating functions for the n-match distribution, Eqs. 1)4.11) . 
()4.2|) . ()4.3j) and ()4.4j) . the authors perform a Laurent expansion of the generating function 
around its dominant n + 1 order pole at Zi. Such an expansion is asymptotic in the 
interactions and runs the risk of capturing more accurately the tail of the interaction 
rather than its core (or at least many terms must be kept in order to capture the core 
part to a sufficient degree of accuracy ISOl)- The approximation scheme presented here 
precisely avoids this by introducing a cut-off distance A and keeping the exact interaction 
upto A, while approximating the interactions only beyond A. As we have shown, this 
is easily done, since the structure of the core of the interaction (6 < /) directly follows 
from the overlap properties of the string to be matched. Our analysis also shows that 
since the core part of the interaction is typically stronger than the exponentially decaying 
tail, keeping the core is crucial in determining the global properties of the distribution. 
Moreover, our approach allows us to understand approximations such as the compound 
distribution as being applicable in a regime where the tails of the interaction can be 
neglected and only the core is kept. This also highlights the relative importance of the 
core part of the interaction with respect to its tail. 

Lastly, we would like to remark that our treatment of interactions, by separating out 
its strong and short-ranged core from its weak tail, is actually not new. Interactions with 
a strong core and an exponentially decaying tail are known as Kac potentials, named after 
M. Kac, who along with co-workers studied one-dimensional particle systems with such 
interactions (continuum and lattice version) in considerable detail, as part of an effort to 
understand the liquid-gas transition in the context of the van der Waals equation of state 
[36, ^3 EH] (for an overview, see the review article by Hemmer and Lebowitz [SHI)- 

Such systems are interesting, since they lead to phase transitions in the limit when 
the characteristic decay length of the interaction tends to infinity jSHl EZl IHHI (for an 
overview of phase transitions in one dimensions see, the review article by Griffiths jH^). 
The similarity of such systems with the string matching problem is at hand, since one can 
make the interactions to decay as slowly as one wishes by choosing a suitable random letter 
distribution and string x to be matched such that the dominant poles of the generating 
function of the n-match distribution function become arbitrarily close to each other. It 
would therefore be of interest to see whether the distribution functions in this regime 
can be calculated using the more sophisticated techniques, such as integral equations and 
operator methods, which have been introduced particularly for the purpose of dealing 
with such types of interactions p8] . 

Acknowledgments I would like to thank Ay§e Erzan for both initially bringing to 
my attention the string matching problem and later pointing out the connection with 



M. Mungan String Matching and Id Lattice Gases (DRAFT, August 25, 2005) 42 

Kac potentials. This work was supported in part by the Nahide and Mustafa Saydan 
Foundation and Tiibitak, the Turkish Science and Technology Research Council. 

References 

[1] S. B. Boyer and J. S. Moore, Comm. ACM 20, (1977) 762, 

[2] D.E. Knuth, J. H. Morris and V.R. Pratt, SIAM J. Comput. 6, (1977) 323, 

[3] P. Pevzner, M. Bordovsky and A. Mironov, J. Biomol. Struct. Dynamics 6 (1991) 
1013, 

[4] B. Prum, F. Rodolphe, and E. Turckheim, J. Roy Stat. Soc. Ser. B57 (1995) 205, 
[5] S. Karlin and S.F. Altshul, Proc. Natl. Acad. Sci. USA 90, 5873 (1993), 
[6] A. Dembo and S. Karlin, Ann. Prob. 19 1773 (1991), 

[7] R. V. Sole and R. Pastor-Satorras, "Complex Networks in Genomics and Pro- 
teomics," S. Bornholdt and H.G.Schuster eds.. Handbook of Graphs and Networks 
(Wiley- VCH Verlag, Berlin 2002). 

[8] L. J. Guibas and A. M. Odlyzko, SIAM J. Appl. Math 35, (1978) 401 

[9] L. J. Guibas and A. M. Odlyzko, J. Comb. Theory 30A, (1981) 19, 

[10] L. J. Guibas and A. M. Odlyzko, J. Comb. Theory 30A, (1981) 183, 

[11] O. Chrysaphinou and S. Papastavridis, Probab. Th. Rel. Fields, 79 (1988) 129, 

[12] M. X. Geske, A. P. Godbole, A. A. Schaffner, A. M. Skolnick and G. L. Wallstrom, 
J. Appl. Prob. 32 (1995) 877, 

[13] 1. Fudos, E. Pitoura and W. Szpankowski, Inform. Process. Lett. 57 (1996) 307, 

[14] M. Regnier and W. Szpankowski, Algorithmica 22 (1998) 631, 

[15] L. Goldstein and M.S. Waterman, Bull. Math. Biol. 54 (1992) 785, 

[16] M. S. Waterman, Introduction to Computational Biology (Chapman & Hall, Boca 
Raton, 1995), 

[17] S. Schbath, ESAIM Prob. Stat. 1 (1995) 1, 

[18] G. Reinert, S. Schbath and M. S. Waterman, J. Comp. Biol. 7 (2000) 1, 
[19] S. Robin and S. Schbath, J. Comp. Biol. 8 (2001) 349, 



M. Mungan String Matching and Id Lattice Gases (DRAFT, August 25, 2005) 43 



[20 
[21 
[22 
[23 
[24 
[25 

[26; 

[27 
[28 
[29 
[30 
[31 

[32 

[33 
[34; 

[35 

[36 
[37; 
[3 



S. Kirkpatrick and B. Selman, Science 264 (1994) 1297, 
M. Me zard, G. Parisi and R. Zecchina, Science 297 (2002) 812, 
S. Mertens, M. Me zard and R. Zecchina, cs.CC/0309020j 
D. Achlioptas, A. Naor and Y. Peres, Nature 435 (2005) 759, 
Y. Fu and P. W. Anderson, J. Phys. 19A (1986) 1605, 

R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman and L. Troyansky, Nature 400 
(1999) 137, 

M. Mungan, A. Kabakgioglu, D. Balcan, and A. Erzan, submitted to J. Phys. A, 

Analytical Solution of a Stochastic Content Based Network Model, q-bio.MN/0406049| 



H. Harborth, Zeits. f. Reine Angew. Math. 271 (1974) 139, 

E. Rivals and S. Rahmann, Jour. Comb. Theory 104 (2003) 95, 
J. Kleffe and M. Borodovsky, Comp. Appl. Biosci. 8 (1992) 433, 

H. S. Wilf, generatingfunctionology (Academic Press, Boston, 1994), 

A.D. Barbour, L. Hoist and S. Janson, Poisson Approximation, (Oxford University 
Press, Oxord, 1992), 

W. Feller, An Introduction to Probability Theory and its Applications, (Wiley, N.Y. 
1971). 

F. Giirsey, Proc. Cambr. Phil. Soc. 46 (1950) 182, 

I. Z. Fisher, Statistical Theory of Liquids (University of Chicago Press, Chicago, 
1964), 

G. E. Uhlenbeck and G. W. Ford, Lectures in Statistical Mechanics (American Math- 
ematical Society, Providence, 1963); K. Huang Statistical Mechanics (John Wiley and 
Sons, N.Y., 1987). 

M. Kac, Phys. Fluids 2 (1959) 8, 

M. Kac, G. E. Uhlenbeck and P. C. Hemmer, J. Math. Phys 4 (1963) 216, 

P. C. Hemmer and J. L. Lebowitz, in Phase Transitions and Critical Phenomena, 
(vol. 5B, edited by C. Domb and M.S. Green, Academic Press, London, 1976), 

R. B. Griffiths, in Phase Transitions and Critical Phenomena, (vol. 1, edited by C. 
Domb and M.S. Green, Academic Press, London, 1972). 



